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Abstract 

We propose an effective approach for spatio-temporal 
action localization in realistic videos. The approach first 
detects proposals at the frame-level and scores them with 
a combination of static and motion CNN features. It then 
tracks high-scoring proposals throughout the video using 
a tracking-by-detection approach. Our tracker relies si¬ 
multaneously on instance-level and class-level detectors. 
The tracks are scored using a spatio-temporal motion his¬ 
togram, a descriptor at the track level, in combination with 
the CNN features. Finally, we perform temporal localiza¬ 
tion of the action using a sliding-window approach at the 
track level. We present experimental results for spatio- 
temporal localization on the UCF-Sports, J-HMDB and 
UCF-101 action localization datasets, where our approach 
outperforms the state of the art with a margin of 15%, 7% 
and 12% respectively in mAP. 

1 . Introduction 

Recent work on action recognition mostly focuses on the 
problem of action classification [22, 33, 43]. The goal is to 
assign a category label to a video, in most cases cropped to 
the extent of the action. In a long stream of video, an action 
may have varying temporal extent. Furthermore, the action 
is also spatially localized. Yet, detecting an action in space 
and time remains a challenging task which received little 
attention so far. 

Some previous works address related issues by putting 
emphasis either on the spatial or on the temporal localiza¬ 
tion. Action recognition and localization in still images [3] 
is an extreme example along the first line, where local de¬ 
tectors are trained e.g. with HOG features and localize spa¬ 
tially the person and/or the object. On the other extreme, 
recent work on action recognition and localization from 
videos [7, 19, 31] perform temporal localization, for which 
dense motion features such as dense trajectories [43] proved 
effective. 
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Several recent works address spatial and temporal local¬ 
ization jointly. They resort to figure-centric models [27, 34, 
23], discriminative parts [30] or proposals [15, 46, 11]. Pro¬ 
posals are obtained by hierarchical merging of supervox¬ 
els [15], by maximizing an actionness score [46] or by rely¬ 
ing on selective search regions and CNN features [11]. 

The main challenge in spatio-temporal localization is to 
accommodate the uncertainty of per-frame spatial localiza¬ 
tion and the temporal consistency. If the spatial localiza¬ 
tion performed independently on each frame is too selective 
and at the same time uncertain, then enforcing the tempo¬ 
ral consistency of the localization may fail. Here we use 
proposals to obtain a set of per frame spatial proposals and 
enforce temporal consistency based on a tracker, that simul¬ 
taneously relies on instance-level and class-level detectors. 

Our approach starts from frame-level proposals extracted 
with a high-recall proposal algorithm [47]. Proposals are 
scored using CNN descriptors based on appearance and mo¬ 
tion information [11]. To ensure the temporal consistency, 
we propose to track them with a tracking-by-detection ap¬ 
proach combining an instance-level and class-level detector. 
We then score the tracks with the CNN features as well as 
a spatio-temporal motion histogram descriptor, which cap¬ 
tures the dynamics of an action. At this stage, the tracks 
are localized in space, but the temporal localization needs 
to be determined. Temporal localization is performed using 
a multi-scale sliding-window approach at the track level. 

In summary, this paper introduces an approach for 
spatio-temporal localization, by learning to track, with 
state-of-the-art experimental results on UCF-Sports, J- 
HMDB and UCF-101. A spatio-temporal local descriptor 
allows to single out more relevant tracks and temporally lo¬ 
calize the action at the track level. 

This paper is organized as follows. In Sec. 2, we re¬ 
view related work on action localization. We then present 
an overview of our approach in Sec. 3 and give the details 
in Sec. 4. Finally, Sec. 5 presents experimental results. 

2. Related work 

Most approaches for action recognition focus on action 
classification [1, 33]. Feature representations such as bag- 
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of-words on space-time descriptors have shown excellent 
results [24, 28, 43]. In particular, Wang et al. [44] achieve 
state-of-the-art performance using Fisher Vectors and dense 
trajectories with motion stabilization. Driven by the success 
of Convolutional Neural Networks (CNNs) for many recog¬ 
nition tasks (image classification [25], object detection [9], 
etc.) feature representations output by CNNs have been ex¬ 
tended to videos. Such approaches use 3D convolutions on 
a stack of frames [17, 22, 40], apply recurrent neural net¬ 
work on per-frame features [4], or process images and opti¬ 
cal flows in two separate streams [37]. CNN representations 
now achieve comparable results to space-time descriptors. 

For temporal localization of actions, most state-of-the- 
art approaches are based on a sliding window [7, 44]. To 
speed-up the localization, Oneata et al. [31] proposed an 
approximately normalized Fisher Vector, allowing to re¬ 
place the sliding window by a more efficient branch-and- 
bound search. The sliding-window paradigm is also com¬ 
mon for spatio-temporal action localization. For instance, 
Tian et al. [39] extend the deformable part models, intro¬ 
duced in [6] for object detection in 2D images, to 3D im¬ 
ages by using HOG3D descriptors [24] and employ a sliding 
window approach, in scale, space and time. Wang et al. [45] 
first use a temporal sliding window and then model the re¬ 
lations between dynamic-poselets. Laptev and Perez [29] 
perform a sliding window on cuboids, thus restricting the 
action to have a fixed spatial extent across frames. 

Another category of action localization approaches uses 
a figure-centric model. Lan et al. [27] learn a spatio- 
temporal model for an action using a figure-centric visual 
word representation, where the location of the subject is 
treated as a latent variable and is inferred jointly with the 
action label. Prest et al. [34] propose to detect humans and 
objects and then model their interaction. Humans detec¬ 
tors were also used by Klaser et al. [23] for action localiza¬ 
tion. The detected humans are then tracked across frames 
using optical flow and the track is classified using HOG- 
3D [24] . Our approach also relies on tracking, but is more 
robust to appearance and pose changes by using a tracking- 
by-detection approach [12, 21], in combination with a class- 
specific detector. In addition, we classify the tracks using 
per-frame CNN features and spatio-temporal features. 

Some other methods are based on the generation of ac¬ 
tion proposals [15, 46]. Yu and Yuan [46] compute an ac- 
tionness score and then use a greedy method to generate 
proposals. Jain et al. [15] propose a method based on merg¬ 
ing a hierarchy of supervoxels. Ma et al. [30] leverage a 
hierarchy of discriminative parts to represent and localize 
an action. The extension of structured output learning from 
object detection to action localization was proposed by Tran 
and Yuan [41]. Recently, Gkioxari and Malik [11] pro¬ 
posed to use object proposals for action localization. Object 
proposals from SelectiveSearch [42] are detected in each 


frame, scored using features from a two-streams CNN ar¬ 
chitecture, and linked across the video. Our approach is 
more robust since we do not force detections to pass through 
proposals at every frame. Moreover, we combine the per- 
frame CNN features with descriptors extracted at a spatio- 
temporal level to capture the dynamics of the actions. 

3. Overview of the approach 

Our approach for spatio-temporal action localization 
consists of four stages, see Figure 1. We now briefly present 
them and then provide a detailed description in Section 4. 

Extracting and scoring frame-level proposals. Our 

method extracts a set of candidate regions at the frame level. 
We use EdgeBoxes [47], as they obtain a high recall even 
when considering relatively few proposals [13]. Each pro¬ 
posal is represented with CNN features [11]. These CNN 
features leverage both static and motion information and 
are trained to discriminate the actions against background 
regions. This is capital since most of the proposals do not 
contain any action. For each class, a hard negative mining 
procedure is performed in order to train an action-specific 
classifier. Given a test video, frame-level candidates are 
scored with these action-specific classifiers. 

Tracking best candidates. Given the frame-level candi¬ 
dates of a video, we select the highest scoring ones per class 
and track them throughout the video. Our tracking method 
is based on a standard tracking-by-detection approach lever¬ 
aging an instance-level detector as well as a class-level clas¬ 
sifier. The detector is based on the same CNN features as the 
first stage. We perform the tracking multiple times for each 
action, starting from the proposal with the highest score that 
do not overlap with previous computed tracks. 

Scoring tracks. The CNN features only contain informa¬ 
tion extracted at the frame level. Consequently, they are not 
able to capture the dynamics of an action across multiple 
frames. Thus, we introduce a spatio-temporal motion his¬ 
togram (STMH). It is inspired by the success of dense tra¬ 
jectory descriptors [43]. Given a fixed-length chunk from 
a track, we divide it into spatio-temporal cells and compute 
an histogram of gradient, optical flow and motion bound¬ 
aries in each cell. A hard-negative mining is employed to 
learn a classifier for each class. The final score is obtained 
by combining CNN and STMH classifiers. 

Temporal localization. To detect the temporal extent of an 
action, we use a multi-scale sliding window approach over 
tracks. At test time, we rely on temporal windows of differ¬ 
ent lengths that we slide with a stride of 10 frames over the 
tracks. We score each temporal window according to CNN 
features, STMH descriptor and a duration prior learned on 
the training set. For each track, we then select the window 
with the highest score. 
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Figure 1. Overview of our action localization approach. We detect frame-level object proposals and score them with CNN action classifiers. 
The best candidates, in term of scores, are tracked throughout the video. We then score the tracks with CNN and spatio-temporal motion 
histogram (STMH) classifiers. Finally, we perform a temporal sliding window for detecting the temporal extent of the action. 


4. Detailed description of the approach 

In this section, we detail the four stages of our action lo¬ 
calization approach. Given a video of T frames {It}t=i..T 
and a class c G C (C being the set of classes), the task con¬ 
sists in detecting if the action c appears in the video and if 
yes, when and where. In other words, the approach outputs 
a set of regions {Rt}t=tb..te with R (resp. tg) the beginning 
(resp. end) of the predicted temporal extent of the action c 
and Rt the detected region in frame R. 


CNN features 



Figure 2. Illustration of CNN features for a region R. The CNN 
features are the concatenation of the fc7 layer from the spatial- 
CNN and motion-CNN, i.e., a 2x4096 dimensional descriptor. 


4.1. Frame-level proposals with CNN classifiers 

Frame-level proposals. State-of-the-art methods [9] for 
object localization replace the sliding-window paradigm 
used in the past decade by object proposals. Instead of scan¬ 
ning the image at every location, at several scales, object 
proposals allow to significantly reduce the number of can¬ 
didate regions, and narrow down the set to regions that are 
most likely to contain an object. For every frame, we ex¬ 
tract EdgeBoxes [47] using the online code and keep the 
best 256 proposals according to the EdgeBox score. We de¬ 
note by Vt the set of object proposals for a frame R. In 
Section 4.2, we introduce a tracking approach that makes 
our method robust to missing proposals. 

CNN features. Recent work on action recognition [37] and 
localization [11] have demonstrated the benefit of CNN fea¬ 
ture representations, applied separately on images and opti¬ 
cal flows. We use the same set of CNN features as in [11]. 

Given a region resized to 227 x 227 pixels, a spatial- 
CNN operates on RGB channels and captures the static ap¬ 
pearance of the actor and the scene, while a motion-CNN 
takes as input optical flow and captures motion pattern. The 
optical flow signal is transformed into a 3-dimensional im¬ 
age by stacking the x-component, the y-component and the 


magnitude of the flow. Each image is then multiplied by 
16 and converted to the closest integer between 0 and 255. 
In practice, optical flow is estimated using the online code 
from Brox et al. [2]. Eor a region R, the CNN features we 
use are the concatenation of the fc7 layer (4096 dimensions) 
from the spatial-CNN and motion-CNN, see Eigure 2. 

CNN training. We use the same architecture and train¬ 
ing procedure as [11]. We give a brief presentation below 
and refer to their work for more details. The architecture 
is the same for both networks with 5 convolution layers in¬ 
terleaved by pooling and normalization, and then 3 fully 
connected layers interleaved with dropout. The last fully 
connected layer (fc8) has |C| + 1 outputs, one per class 
and an additional output for the background. Similar to [9], 
during training, the proposals that overlap more than 50% 
with the ground-truth are considered as positives, the others 
as background. Regions are resized to flt the network size 
(227 X 227) and randomly flipped. The spatial-CNN is ini¬ 
tialized with a model trained on full images from ImageNet 
and flne-tuned for object detection on Pascal VOC 2012 [9]. 
Eor the motion-CNN, initialization weights are trained for 
the task of action recognition on the UCE-101 dataset [38] 
with full frames of the training set of split 1. We then flne- 
tune the networks with back-propagation using Caffe [18] 
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on the proposal regions for each dataset. Each batch con¬ 
tains 25% of non-background regions. 

Action classifiers. For each action class c G C, we train 
a linear SVM using hard negative mining. The positives 
are given by the ground-truth annotations and negatives by 
all proposals whose overlap with a ground-truth region is 
below 30%. At test time, we denote by 5 'cnn(c, R) the score 
of a region R for the action class c given by the trained 
classifier. This yields a confidence score for the region R 
and an action class c. 

4.2. Tracking 

The second stage consists in tracking the best propos¬ 
als over the video. We use a tracking-by-detection ap¬ 
proach that leverages instance-level and class-level detec¬ 
tors. Let i? be a region in frame Ir for the class c to be 
tracked. As a result, the tracking stage will output a track 
Tc = {Rt}t=i..T^ The track provides a candidate localiza¬ 
tion for the action c. We first present how the tracker is 
initialized. Then, we detail the tracking procedure. Finally, 
we explain the selection of the regions to track. 
Initialization. Given a region R to be tracked in frame R, 
the first step is to refine the position and size of the region 
by performing a sliding-window search both in scale and 
space in the neighborhood of R. Let M{R) be the set of 
windows scanned with a sliding window around the region 
R. The best region according to the action-level classifier 
is selected: R^ = argmax^^^^^^^ 5 'cnn(c, r). The sliding- 
window procedure using CNN features can be performed 
efficiently [10, 36]. 

Given the refined region, we train an instance-level de¬ 
tector using a linear SVM. The set of negatives comprises 
the instances extracted from boxes whose overlap with the 
original region is less than 10%. The boxes are restricted 
to regions in Vr, i-e., the proposals in frame r. The set 
of positives is restricted to the refined region R^. This 
strategy is consistent with current tracking-by-detection ap¬ 
proaches [14]. Denote by 5'inst(^) the score of the region 
R with the instance-level classifier. We now present how 
the tracking proceeds over the video. We first do a forward 
pass from frame Ir to the last frame It, and then a back¬ 
ward pass from frame R to the first frame. 

Update. Given a tracked region Rt in frame R, we now 
want to find the most likely location in frame it+i- We first 
map the region Rt into hy shifting the region with 

the median of the flow between frame R and it+i inside the 
region Rf. We then select the best region in the neighbor¬ 
hood of R'tTi using a sliding window that leverages both 
class-level and instance-level classifiers: 

Rt+i= argmax S'i„st(r) + S'cnn(c, r) . (1) 

In addition, we update the instance-level classifier by 
adding Rt-\-i us a positive exemplar and proposals 


Algorithm 1 Tracking 

Input: a region R in frame R to track, a class c 

Output: a track 7^ = {Rt}t=i..T 

Rr ^ 6'cNN(c,r) 

Pos ^ 

Neg ^{reVr\ lo\J{r,Rr) < 0.1} 

For i = T + 1 ... T and r — 1 ... 1: 

Learn instance-level classifier from Pos and Neg 
Ri ^ argmax^^_;^(j^/)(5'cNN(c,r) + 5'inst(r)) 

Neg ^ Neg u {r eVR loU{r,Ri) < 0.1} 

Neg ^ {r G Neg|5'inst(r) > — 1} (restrict to hard negatives) 
Pos ^ Pos U {Ri} 


from frame it+i that do not overlap with this region as neg¬ 
atives. Note that at each classifier update, we restrict the set 
of negatives to the hard negatives. 

The tracking algorithm is summarized in Algorithm 1. 
By combining instance-level and class-level information, 
our tracker is robust to significant changes in appearance 
and occlusion. Note that category-specific detectors were 
previously used in other contexts, such as face [20] or peo¬ 
ple [8] tracking. We demonstrate the benefit of such detec¬ 
tors in our experiments in Section 5. 

Proposals selection. We now present how we chose the 
proposals to track. We first select the subset of classes 
for which the tracking is performed. To this end, we 
assign a score to the video for each class c e C and 
keep the top-5. The score for a class c is defined as 
mdiXreVt,t=i..T Scnn{cr), i-e., we keep the maximum 
score for c over all proposals of the video. 

When generating tracks for the class c, we first select 
the proposal with the highest score over the entire video. 
We run the tracker starting from this region and obtain a 
first track. We then perform the tracking iteratively, starting 
a new track from the best proposal that does not overlap 
with any previous track from the same class. In practice, 
we compute 2 tracks for each selected class. 

4.3. Track descriptor 

So far, we have only used features extracted on individ¬ 
ual frames. Clearly, this does not capture the dynamics of 
the action over time. To overcome this issue, we introduce 
a spatio-temporal motion histogram (STMH) feature. 

The STMH descriptor. Similar to Wang et al [43], 
we rely on histograms of gradient and motion extracted in 
spatio-temporal cells. Given a track Tc = {Rt}t=i..T, it 
is divided into temporal chunks of L = 15 frames, with a 
chunk starting every 5 frames. Each chunk is then divided 
into Nt temporal cells, and each region Rt into Ng x Ng 
spatial cells, as shown in Figure 3. For each spatio-temporal 
cell, we perform a quantization of the per-pixel image gra¬ 
dient into an histogram of gradients (HOG) with 8 orienta¬ 
tions. The histogram is then normalized with the L2-norm. 
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Figure 3. Illustration of STMH. A chunk is split into spatio- 
temporal cells for which an histogram of gradient, optical flow 
and motion boundaries is computed. 

Similarly, we compute HOF, MBHx and MBHy by replac¬ 
ing the image gradient by the optical flow and the gradient 
of its X and y components. For HOF, a bin for an almost 
zero value is added, with a threshold at 0.04. In practice, 
we use 3 temporal cells and 8x8 spatial cells, resulting in 
3x8x8x (8 + 9 + 84-8) = 6336 dimensions. Note that 
we use more spatial cells than [43], as our regions are on 
average signiflcantly larger than the 32 x 32 patch they use. 
Fusion. For each action, we train a linear SVM using hard 
negative mining. The set of positives is given by features 
extracted along the ground-truth annotations, while the neg¬ 
atives are given by cuboids (spatially and temporally) cen¬ 
tered at the proposals that do not overlap with the ground- 
truth. Let 5'desc(c, T) be the average of the scores for all the 
chunks of length L from the track T for the action c. 

Given a track T = we score it by summing 

the scores from the CNN averaged over all frames, and the 
scores from the descriptors averaged over chunks: 

T 

S{T) = (T(5desc(c, T)) + <t( ^ 5cnn(c, Rt)) , (2) 

t=l 

where cr{x) = 1/(1+ e“^). We summarize the resulting 
approach for spatio-temporal detection in Algorithm 2. 

4.4. Temporal localization 

Similar to the winning approach in the temporal action 
detection track of the Thumos 2014 challenge [32], we use 
a sliding-window strategy for temporal localization. How¬ 
ever, we apply the sliding window directly on each track T, 
while [32] used features extracted for the full frames. The 
window length takes values of 20, 30, 40, 50, 60, 70, 80, 
90, 100, 150, 300, 450 and 600 frames. The sliding window 
has a stride of 10 frames. For each action c, we learn the 
frequency of its durations on the training set. We score each 
window using the score described above based on CNNs 
features and STMH, normalized with a sigmoid, and multi¬ 
ply it with the per-class duration prior. For each track, we 
keep the top-scoring window as spatio-temporal detection. 


Algorithm 2 Spatio-temporal detection in a test video 
Input: a test video {It}t=i...T 
Output: a list of detections (c, T, score) 

For t = 1..T 

Vt — EdgeBoxes(/t) 

For r eVt 

Compute 5'cnn(c, r) 

C' ^ class selection (see Sec. 4.2) 

Detections ^ [ ] 

For c e C' 

For i — .ntracks (we generate ntracks=2 tracks per label, see Sec. 4.2) 
R,t ^ ^ *S'cnn(c, r) 

(proposal to track without overlap with previous tracks) 

T ^ Tracking (i?, + , c) (Algorithm 1) 

score ^ cr( aSstmh (c, T)) + cr(5])j^^^^ *S'cnn(c, Rt)) (Eq. 2) 

Detections ^ Detections U {(c, 7^, score)} 


5. Experimental results 

In this section, we first present the datasets and the eval¬ 
uation protocol. We then study the impact of both the track¬ 
ing and the class selection, and provide a parametric study 
of STMH. Finally, we show that our approach outperforms 
the state of the art for spatio-temporal action localization. 

5.1. Datasets and evaluation 

In our experiments, we use three datasets: UCF-Sports, 
J-HMDB and UCF-IOI. 

UCF-Sports [35]. The dataset contains 150 short videos 
of sports broadcasts from 10 actions classes: diving, golf 
swinging, kicking, lifting, horse riding, running, skating, 
swinging on the pommel horse and on the floor, swinging at 
the high bar and walking. Videos are truncated to the action 
and bounding boxes annotations are provided for all frames. 
We use the standard training and test split defined in [27] . 
J-HMDB [16]. The dataset is a subset of the HMDB 
dataset [26]. It consists of 928 videos for 21 different ac¬ 
tions such as brush hair, swing baseball or jump. Video 
clips are restricted to the duration of the action. Each clip 
contains between 15 and 40 frames. Human silhouettes are 
annotated for all frames. Thus, the dataset can be used for 
evaluating action localization. There are 3 train/test splits 
and evaluation averages the results over the three splits. 
UCF-101 [38]. The dataset is dedicated to action clas¬ 
sification with more than 13000 videos and lOI classes. 
For a subset of 24 labels, the spatio-temporal extents of 
the actions are annotated. All experiments are performed 
on the first split only. In contrast to UCF-Sports and J- 
HMDB where the videos are truncated to the action, UCF- 
IOI videos are longer and the localization is both spatial and 
temporal. Figure 4 shows a histogram of the action dura¬ 
tions in the training set, averaged over all 24 classes. Some 
of the actions are long, such as ‘soccer juggling’ or ‘ice 
dancing’, whereas others last only few frames, e.g. ‘tennis 
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Figure 4. Histogram of action durations for the 24 classes with 
spatio-temporal annotations in the UCF-101 dataset (training set). 
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instance-level only 

85.42% 

94 . 59 % 

74.27% 

54.32% 

class-level only 

92.92% 

81.28% 

85.67% 

53.25% 


Table 1. Impact of the detectors used in the tracker. We mea¬ 
sure if the tracks generated for the ground-truth label cover the 
ground-truth tracks (recall-track). We also measure the impact of 
the tracker on the final detection performance (mAP). The experi¬ 
ments are done on UCF-Sports and J-HMDB (split 1 only). 

swing’ or ‘basketball dunk’. 

Evaluation metrics. A detection is considered “correct” if 
the intersection over union (loU) with the ground-truth is 
above a threshold S. The loU between two tracks is defined 
as the loU over the temporal domain, multiplied by the aver¬ 
age of the loU between boxes averaged over all overlapping 
frames. Duplicate detections are considered as “incorrect”. 
By default, the reported metric is the mean Average Preci¬ 
sion at threshold 6 = 50% for spatial localization (UCF- 
Sports and J-HMDB) and 6 = 20% for spatio-temporal lo¬ 
calization (UCF-101). When comparing to the state of the 
art on UCF-Sports, we also use ROC curves and report the 
Area Under the Curve (AUC) as done by previous work. 
Note that this metric is impacted by the set of negatives de¬ 
tections and, thus, may not be suited for a detection task [5]. 
Indeed, if one adds many easy negatives, i.e., negatives that 
are ranked after all positives, the AUC increases while the 
mAP remains the same. 

5.2. Impact of the tracker 

The strength of our approach lies in the combination of 
class-specific and instance-level detectors in the tracker. To 
measure the benefit of this combination. Table 1 compares 
the performance when removing one of them. ‘Recall- 
tracks’ measures if at least one of the 2 generated tracks for 
the ground-truth action covers the ground-truth annotations 
(loU > 0.5), it measures the recall at the track level. We 
also measure the impact on the final detection performance 
(mAP) by running our full pipeline with each tracker. 

On UCF-Sports, tracking obtained by combining the de¬ 
tectors leads to the highest recall. Using the instance-level 
detector significantly degrades the recall by 13%. This can 
be explained by the abrupt changes in pose and appearance 
for actions such as diving or swinging. On the other hand. 



withou 

Linking 

tSTMH 

Tracking 
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Linking 

5TMH 

Tracking 

SelectiveSearch [42] 

75.94% 

83.77% 

77.1% 

84.9% 

EdgeBoxes-256 [47] 

79.89% 

88.23% 

83.2% 

90.5% 


Table 2. Comparison of tracking and linking, SelectiveSearch 
and EdgeBoxes-256 proposals without and with STMH on UCF- 
Sports (localization in mAP). 


the instance-level detector performs well on the J-HMDB 
dataset, which contains more static actions 

Combining instance-level and class-specific classifiers 
also gives the best performance in term of final detection 
results. On UCF-Sports, this is mainly due to the higher 
recall. On J-HMDB, we find that using the instance-level 
detector only leads to a better recall but the precision de¬ 
creases because there are more tracks from an incorrect la¬ 
bel that have a high score. 

Table 2 compares the localization mAP on UCF-Sports 
when using our proposed tracker or a linking strategy 
as in [11]. We experiment with proposals from Selec¬ 
tiveSearch [42] (approximately 2700 proposals per frame) 
or EdgeBoxes [47] (top-256), with or without STMH. We 
can see that using EdgeBoxes instead of SelectiveSearch 
leads to a gain of 6% when using STMH. Using a track¬ 
ing strategy leads to a further gain of 7%, with in addition a 
more refined localization, see Eigure 6. This shows that the 
tracker is a key component to the success of our approach. 

5.3. Class selection 

We now study the impact of selecting the top-5 classes 
based on the maximum score over all proposals from a 
video for a given class, see Section 4.2. We measure the 
percentage of cases where the correct label is in the top-k 
classes and shows the results in Eigure 5 (blue curve). Most 
of the time, the correct class has the highest maximum score 
(around 85% on UCE-Sports and 61% on J-HMDB). If we 
use top-5, we misclassify less than 10% of the videos on 
J-HMDB, and 0% on UCE-Sports. 

Eigure 5 also shows that recall (green) is lower than the 
top-k accuracy because the generated tracks might not have 
a sufficient overlap with the ground-truth due to a failure of 
the tracker. The difference between recall and top-k accu¬ 
racy is more important for large k. This can be explained 
by the fact that the class-level detector performs poorly for 
videos where the correct label has a low rank, therefore the 
class-specific tracker performs poorly as well. 

In addition, we display in red the evolution of the mAP 
on UCE-Sports and J-HMDB (split 1 only) when changing 
the number of selected classes. Initially, the performance 
significantly increases as this corrects the cases where the 
correct label is top-k but not first, i.e., the recall increases. 
The performance then saturates since, even in the case 
where a new correct label is tracked over a video, the fi¬ 
nal score will be low and will not have an important impact 
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Number of selected classes Number of selected classes 


Figure 5. Impact of the class selection on UCF-Sports {left) and J- 
HMDB {right) datasets. In blue, top-k accuracy is shown, Le., the 
percentage of cases where the correct label is in the top-k classes. 
The recall when changing the number of selected classes is shown 
in green and the mAP in red. 


Nt 

Ns 

dimension 

UCF-Sports 

J-HMDB 


2 

132 

76.07% 

38.78% 

1 

4 

528 

80.00% 

48.58% 

1 

8 

2112 

82.50% 

51.71% 


16 

8448 

81.67% 

49.54% 


2 

264 

77.98% 

41.21% 


4 

1056 

80.00% 

49.41% 

Z 

8 

4224 

87.50% 

52.72% 


16 

16896 

82.50% 

48.89% 


2 

396 

82.74% 

41.38% 

Q 

4 

1584 

83.33% 

50.52% 

J 

8 

6336 

87.50% 

54.26% 


16 

25344 

84.17% 

47.98% 


2 

660 

79.64% 

41.51% 

C 

4 

2640 

80.00% 

50.84% 

D 

8 

10560 

88.33% 

52.11% 


16 

42240 

84.17% 

47.81% 


Table 3. Comparison of mean-Accuracy when classifying ground- 
truth tracks using STMH with different numbers of temporal {Nt) 
and spatial {Ns) cells. 



Method 

mAP 

[11] 

75.8 

no STMH 

ours 

88.2 

90.5 


Figure 6. Comparison to the state of the art on UCF-Sports. Left: 
AUC for varying loU thresholds. Right: mAP at (5 = 50%. ‘no 
STMH’ refers to our method without rescoring based on STMH. 



Figure 7. Example results from the UCF-Sports dataset. 


Vector encoding (256 GMMs) and a Hellinger kernel. Note 
that the resulting representation has 100k dimension, i.e., is 
significantly higher dimensional. Furthermore, STMH is an 
order of magnitude faster to extract. 

5.5. Comparison to the state of the art 


on the precision. As a summary, selecting the top-k classes 
performs similar as keeping all classes while it significantly 
reduces the computational time. 

5.4. STMH parameters 

We now study the impact of the number of temporal and 
spatial cells in STMH. For evaluation, we consider the clas¬ 
sification task and learn a linear SVM on the descriptors 
extracted from the ground-truth annotations of the training 
set. We then predict the label on the test set, assuming the 
ground-truth localization is known, and report mean Accu¬ 
racy. Results are shown in Table 3. We can see that the best 
performance is obtained with Ng = S spatial cells on both 
datasets, independently of the number of temporal cells Nf. 
By increasing the number of cells to a higher value, e.g. 16, 
the descriptor becomes too specific for a class. When using 
a unique temporal cell, i.e., Nt = 1, the performance is sig¬ 
nificantly worse than for Nf = 3. We choose Ns = 8 and 
= 3 in the remainder of the experiments. The resulting 
STMH descriptor has 6, 336 dimensions. 

Using the same protocol, we obtain a performance of 
91.9% for UCF-Sports and 57.99% for J-HMDB using 
state-of-the-art improved dense trajectories [44] with Fisher 


In this section, we compare our approach to the state 
of the art. On UCF-Sports, past work usually give ROC 
curves and report Area Under the Curve (AUC). Figure 6 
(left) shows a comparison with the state of the art using 
the same protocol for different loU thresholds 6. We can 
observe that our approach outperforms the state of the art. 
Note that at a low threshold, all methods obtain a compa¬ 
rable performance, but the gap widens for larger one, i.e., 
more precise detections. Indeed, our spatial localization en¬ 
joys a high precision thanks to the tracking: the position of 
the detected region is refined in each frame using a sliding 
window. As a consequence, the loU between our detected 
tracks and the ground-truth is high, explaining why our per¬ 
formance remains constant between a low threshold and a 
high threshold S. Figure 7 shows example results. Despite 
important changes in appearance, the actor is successfully 
tracked throughout the video. For detection, mAP is more 
suitable as it does not depend on negatives. Results are 
shown in Figure 6 (right). We outperform the state of the 
art with a margin of 15% and obtain a mAP of 90.5%. We 
also compute the mAP when scoring without STMH clas¬ 
sifiers, i.e., the score is based on CNN features only, and 
observe a drop of 2%. 
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0.2 

0.3 

0.4 

0.5 

[11] 




53.3 

no STMH 

58.1 ±2.1 

58.0 ± 1.9 

57.7 ±2.1 

56.5 ± 2.6 

ours 

63.1 ± 1.8 

63.5 ± 1.8 

62.2 ± 1.9 

60.7 ± 2.7 


Table 4. Comparison to the state of the art on J-HMDB using mAP 
for varying loU thresholds S. We also report the standard deviation 
among the splits. 


(5 

0.05 

0.1 

0.2 

0.3 

[46] 

42.8 




ours 

54.28 

51.68 

46.77 

37.82 


Table 5. Localization results (mAP) on UCF-101 (split 1) for dif¬ 
ferent loU thresholds 6. 









Figure 9. Example results from the UCF-101 dataset. 


The results for the J-HMDB dataset are given in Table 4. 
We also outperform the state of the art by more than 7% 
on J-HMDB at a standard threshold 6 = 0.5. In particu¬ 
lar, adding STMH leads to an improvement of 4%. We can 
also see that the mAP is stable w.r.t. the threshold 6. This 
highlights once again the high precision of the spatial detec¬ 
tions, i.e., they ah have a high overlap with the ground-truth, 
thanks to the tracking. 

Finally, we report the results for spatio-temporal detec¬ 
tion on the UCF-101 dataset in Table 5. We obtain a mAP 
of more than 47% at a standard threshold 6 = 20% de¬ 
spite the challenge of detecting an action both spatially and 
temporally. At a threshold 6 = 5%, we obtain a mAP of 
54% compared to 42% reported by [46]. Figure 8 and 9 
show example results. We can observe that the result for 
the action “Basketball” is precise both in space and time. 
While most of the 24 action classes cover almost the entire 
video, i.e., there is no need for temporal localization, the ac¬ 
tion “Basketball” covers on average one fourth of the video, 
i.e., it has the shortest relative duration in UCF-101. For 
this class our temporal localization approach improves the 


performance signihcantly. The AP for Basketball is 28.6% 
(6 = 20%) with our full approach. If we remove the tempo¬ 
ral localization step, the performance drops to 9.63%. This 
shows that our approach is capable of localizing actions in 
untrimmed videos. With respect to tracking in untrimmed 
videos, tracking starts from the highest scoring proposal in 
both directions (forward and backward) and continues even 
if the action is no longer present. The temporal sliding win¬ 
dow can then localize the action and removing parts with¬ 
out the action. Future work includes designing datasets for 
spatio-temporal localization in untrimmed videos in order 
to evaluate temporal localization more thoroughly. 

6. Conclusion 

We present an effective approach for action localization, 
that detects actions in space and time. Our approach builds 
upon object proposals extracted at the frame level that we 
track throughout the video. Tracking is effective, as we 
combine instance-level and class-level detectors. The re¬ 
sulting tracks are scored by combining classihers learned 
on CNN features and our proposed spatio-temporal descrip¬ 
tors. A sliding window hnahy performs the temporal local¬ 
ization of the action. The proposed approach improves on 
the state of the art by a margin of 15% in mAP on UCFS- 
ports, 7% on J-HMDB and 12% on UCF-101. 
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